[Core][Distributed] use device group for all broadcast #5320

youkaichao · 2024-06-06T17:37:47Z

A fix to #4444

That PR was originally tested on A100, and it showed some speedup.

However, later it seems to slow down on H100 and other machines. The hypothesis is that gloo performs poorly while nvlink is better in these high-end machines.

Before we find a good solution, we can just use device group for all the broadcast.

TODO:

investigate if it is possible and beneficial to only communicate cpu data, using mechanisms such as message queue.

WoosukKwon · 2024-06-06T21:32:54Z

@youkaichao Could you add some performance numbers on H100? I'm wondering how this affects the performance.

youkaichao · 2024-06-06T21:58:40Z

see https://docs.google.com/spreadsheets/d/1c9xgR0fGvm6SROfk7vrjwOZdYnKQk9oOafWK4_KgOyo/edit#gid=593626425 for reference. in particular, check the gloo and nccl part, around byte size 1k ~ 2k. That's the rough data size we broadcast twice.

youkaichao · 2024-06-15T18:11:46Z

close as #5399 will be better.

use device group for all broadcast

5363887

fix test: for cpu tensor (only exist in tests), use metadata_group

f0d8d86

youkaichao closed this Jun 15, 2024

youkaichao deleted the change_broadcast_group branch June 15, 2024 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] use device group for all broadcast #5320

[Core][Distributed] use device group for all broadcast #5320

youkaichao commented Jun 6, 2024

WoosukKwon commented Jun 6, 2024

youkaichao commented Jun 6, 2024

youkaichao commented Jun 15, 2024

[Core][Distributed] use device group for all broadcast #5320

[Core][Distributed] use device group for all broadcast #5320

Conversation

youkaichao commented Jun 6, 2024

WoosukKwon commented Jun 6, 2024

youkaichao commented Jun 6, 2024

youkaichao commented Jun 15, 2024